Data wrangling

GEOG 30323

October 2, 2018

Data wrangling

In real-world data analysis, your data will likely:

  • Have missing/possibly incorrect values
  • Be in a format unsuitable for data analysis
  • Be spread across multiple files, possibly of different types
  • Need re-shaping or summarization to draw meaningful conclusions

Fortunately, pandas can help you with all of this!

Subsetting

  • Frequently, you’ll have way more data than you need!
  • Datasets can be reduced in size by indexing and subsetting
  • Let’s read in the colleges dataset as a demo

By column name

  • Let’s drop most of the columns in the dataset

By row position

  • Data frames can be sliced like lists and strings

By row or column index

  • Selecting by row or column index available in the .loc[] method (note the brackets)

By value

  • Often, you’ll want to keep rows that have a certain column value, or exclude rows based on that value
  • Statements use conditional operators; can be combined with & (and) and | (or)
  • .isin() method: checks to see if value is in list of values

By value

By value

By value

Creating new columns

  • New columns can be created based on specified values, or as derivatives of other columns, using mathematical operators or the .assign() method
  • Let’s demo with a simulated data frame:

Creating new columns

dtype conversion

  • To do numerical analysis, our numeric data have to be stored as numbers!
  • To convert: use the .astype() method
  • Note: SettingWithCopyWarning

Missing data

  • Commonly, all of the data you need will not be found in your data set!
  • Possible solutions:
    • Delete all rows that have missing data
    • Fill in missing data with a specified value
    • Interpolate missing values

Missing data

  • .dropna() method: delete all rows (or columns) that have any missing values (NaN in pandas)

Missing data

  • .fillna() method: fill in missing data with a specified value

Group-wise data analysis

  • Thus far, we’ve focused on characteristics of data within a particular group
  • Common question: how do characteristics vary by group?
  • In pandas: .groupby() method!

Split-apply-combine

  • Wickham (2011): the “split-apply-combine” model of data analysis

Process:

  • Data are split by some characteristic into groups
  • We apply a function to each of the groups
  • The resultant data are combined back into a single dataset

.groupby() in pandas

Grouped visualization in seaborn

Grouped visualization in seaborn

  • Faceting or small multiples: breaking down a plot by a grouping variable into multiple plots

Grouped visualization in seaborn

  • We can also get creative with our code!

Merging data

  • Commonly, you’ll have data in two - or multiple! - datasets that you’ll want to combine into one
  • Simulated data:

The .merge() method in pandas

Types of merges in pandas

  • Options for merging (the how parameter): 'inner' (default), 'left', 'right', and 'outer'
  • Simulated data:

Inner merges

Left merges

Right merges

Outer merges

The “shape” of data

  • Long (“tidy”) data:
    • Each variable forms a column;
    • Each observation forms a row;
    • Each type of observational unit forms a table
  • Wide data: column headers represent values, not variable names

Example: World Bank data

  • Long format:

Long to wide

  • .pivot() method in pandas

Plotting “wide” data

Wide to long

  • pd.melt() function in pandas

Plotting long-form data